Extracts BitString to code points array to BitString class by mward-sudo · Pull Request #721 · bartblast/hologram

mward-sudo · 2026-02-19T01:12:54Z

Closes #720

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

@coderabbitai Do not review this code while the PR is in draft.

Summary by CodeRabbit

Refactor
- Standardized UTF‑8 validation and handling across character processing.
- Improved handling for invalid or truncated UTF‑8 sequences and more reliable codepoint conversion.
- Enhanced Unicode normalization paths (NFC, NFD, NFKC, NFKD).
Tests
- Expanded test coverage for UTF‑8 decoding, validation, truncation detection, and codepoint conversion.

…ds parameter validation

coderabbitai · 2026-02-19T01:13:14Z

📝 Walkthrough

Walkthrough

Centralizes UTF‑8 validation and decoding into Bitstring by adding static UTF‑8 helpers (decoding, validation, truncation detection, and codepoint conversion) and refactors assets/js/erlang/unicode.mjs to use these helpers; adds extensive tests for the new Bitstring behavior.

Changes

Cohort / File(s)	Summary
UTF‑8 Utilities in Bitstring `assets/js/bitstring.mjs`	Added static UTF‑8 helpers: `decodeUtf8CodePoint`, `getValidUtf8Length`, `isValidUtf8CodePoint`, `isValidUtf8ContinuationByte`, `isValidUtf8Sequence`, `isTruncatedUtf8Sequence`, `toCodepointArray` (plus related sequence-length logic).
Unicode module refactor `assets/js/erlang/unicode.mjs`	Removed in-file UTF‑8 helpers and replaced usages with Bitstring utilities across character conversion and normalization flows (NFC/NFD/NFKC/NFKD); internal logic changed, public APIs unchanged.
Tests for Bitstring UTF‑8 `test/javascript/bitstring_test.mjs`	Added comprehensive tests covering decoding, sequence-length, valid-length detection, truncation, continuation checks, codepoint array conversion, and edge cases (overlongs, surrogates, out-of-range, truncation).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Extract UTF-8 sequence length detection to Bitstring class #706 — Extends Bitstring with UTF‑8 decoding/validation utilities that align with the current extraction of sequence-length logic.
Port :unicode.characters_to_nfc_binary/1 to JS #544 — Adds binary↔codepoint helpers and in-file UTF‑8 validation that overlap the refactoring in unicode.mjs to use Bitstring.
Port :unicode.characters_to_nfc_list/1 to JS #580 — Modifies UTF‑8 decoding/validation used by assets/js/erlang/unicode.mjs; changes likely intersect with this centralization.

Suggested reviewers

bartblast

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: extracting BitString-to-codepoints-array functionality into the BitString class as a static method.
Linked Issues check	✅ Passed	The PR fulfills issue `#720` by successfully extracting toCodepointArray(bitstring) as a static BitString class method, along with supporting UTF-8 validation utilities.
Out of Scope Changes check	✅ Passed	All changes are in scope: BitString.mjs additions (UTF-8 utilities and toCodepointArray), unicode.mjs refactoring to use new BitString methods, and comprehensive test coverage.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

assets/js/bitstring.mjs (1)
255-269: Consider hoisting lookup objects to module/class level.

firstByteMasks (Line 259) and minValueForLength (Line 623) are allocated on every call. Since these are constants, they could be static class fields or module-level constants to avoid repeated allocation.
♻️ Suggested refactor
 export default class Bitstring {
   static `#decoder` = ERTS.utf8Decoder;
   static `#encoder` = new TextEncoder("utf-8");
+  static `#utf8FirstByteMasks` = {2: 0x1f, 3: 0x0f, 4: 0x07};
+  static `#utf8MinCodePointForLength` = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
Then reference $.#utf8FirstByteMasks[length] and $.#utf8MinCodePointForLength[encodingLength] respectively.
Also applies to: 617-633
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 255 - 269, The function
decodeUtf8CodePoint allocates the lookup object firstByteMasks on every call
(and similarly minValueForLength elsewhere), so hoist these constant maps to
module/class scope as static fields to avoid repeated allocation; create
constants (e.g. utf8FirstByteMasks and utf8MinCodePointForLength) at the top of
the module or as private static fields on the BitString class and replace local
usages of firstByteMasks and minValueForLength with references to those
static/module constants (e.g., BitString.#utf8FirstByteMasks[length] or the
module-level utf8FirstByteMasks[length]).
assets/js/erlang/unicode.mjs (1)
271-354: Consider extracting a shared normalization handler parameterized by form.

The characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 functions are structurally identical — they differ only in the normalization form string ("NFC", "NFD", "NFKC", "NFKD"). The handleInvalidUtf8, handleConversionError, and validateListRest helpers are duplicated verbatim across all four.

Since this PR already modifies the handleInvalidUtf8 in each variant (switching to Bitstring.getValidUtf8Length), it would be a natural time to extract a shared factory:
♻️ Sketch
const makeNormalizationBinaryFn = (form) => (data) => {
  const validateListRest = (rest) => { /* shared */ };
  const handleConversionError = (tag, prefix, rest) => {
    // ...textPrefix.normalize(form)...
  };
  const handleInvalidUtf8 = (bytes) => {
    // ...validText.normalize(form)...
  };
  // ...main logic identical...
};

"characters_to_nfc_binary/1": makeNormalizationBinaryFn("NFC"),
"characters_to_nfd_binary/1": makeNormalizationBinaryFn("NFD"),
"characters_to_nfkc_binary/1": makeNormalizationBinaryFn("NFKC"),
"characters_to_nfkd_binary/1": makeNormalizationBinaryFn("NFKD"),
Also applies to: 501-583, 587-668, 672-754
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 271 - 354, The four functions
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 duplicate the same
helpers (validateListRest, handleConversionError, handleInvalidUtf8) and differ
only by the normalization form string; extract a factory like
makeNormalizationBinaryFn(form) that returns the function implementing the
shared logic and uses the form when calling String.prototype.normalize (i.e.,
replace hardcoded "NFC" with the parameter), then replace each of the four
exported entries with makeNormalizationBinaryFn("NFC"),
makeNormalizationBinaryFn("NFD"), makeNormalizationBinaryFn("NFKC"), and
makeNormalizationBinaryFn("NFKD") respectively, keeping references to
Erlang_Unicode["characters_to_binary/3"], Bitstring.toText,
Bitstring.getValidUtf8Length, Type.bitstring, and Type.tuple unchanged inside
the factory.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 659-678: isTruncatedUtf8Sequence can return true when start >=
bytes.length because bytes[start] is undefined; add an early guard in
isTruncatedUtf8Sequence to return false if start is out of range (start < 0 or
start >= bytes.length) so you never read bytes[start]; keep the rest of the
logic intact (using $.getUtf8SequenceLength and $.isValidUtf8ContinuationByte)
after this check to correctly detect only true truncated sequences.
- Around line 832-839: toCodepointArray currently calls maybeSetTextFromBytes
which can set bitstring.text to false on UTF-8 decode failure, and then
Array.from(false, ...) silently returns []; change to explicitly guard after
maybeSetTextFromBytes in static toCodepointArray(bitstring): if bitstring.text
=== false (or not a string) throw a descriptive TypeError (or return null per
API decision) indicating invalid UTF-8 bitstring so callers don't get an empty
array silently; reference the toCodepointArray method and maybeSetTextFromBytes
when adding the check and the error message.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 255-269: The function decodeUtf8CodePoint allocates the lookup
object firstByteMasks on every call (and similarly minValueForLength elsewhere),
so hoist these constant maps to module/class scope as static fields to avoid
repeated allocation; create constants (e.g. utf8FirstByteMasks and
utf8MinCodePointForLength) at the top of the module or as private static fields
on the BitString class and replace local usages of firstByteMasks and
minValueForLength with references to those static/module constants (e.g.,
BitString.#utf8FirstByteMasks[length] or the module-level
utf8FirstByteMasks[length]).

In `@assets/js/erlang/unicode.mjs`:
- Around line 271-354: The four functions characters_to_nfc_binary/1,
characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and
characters_to_nfkd_binary/1 duplicate the same helpers (validateListRest,
handleConversionError, handleInvalidUtf8) and differ only by the normalization
form string; extract a factory like makeNormalizationBinaryFn(form) that returns
the function implementing the shared logic and uses the form when calling
String.prototype.normalize (i.e., replace hardcoded "NFC" with the parameter),
then replace each of the four exported entries with
makeNormalizationBinaryFn("NFC"), makeNormalizationBinaryFn("NFD"),
makeNormalizationBinaryFn("NFKC"), and makeNormalizationBinaryFn("NFKD")
respectively, keeping references to Erlang_Unicode["characters_to_binary/3"],
Bitstring.toText, Bitstring.getValidUtf8Length, Type.bitstring, and Type.tuple
unchanged inside the factory.

assets/js/bitstring.mjs

mward-sudo · 2026-02-19T01:23:19Z

@coderabbitai do not review this code again while it is in draft. Also, please change the summary to include only changes from the latest commit in this PR.

coderabbitai · 2026-02-19T01:23:43Z

@mward-sudo, understood! I will not review this PR while it remains in draft status.

I'll regenerate the summary to include only the changes from the latest commit in this PR.

Please regenerate the summary to include only the changes from the latest commit in PR #721, excluding changes from the dependent PRs (#710, #715, #716, #717, #718, #719).

✏️ Learnings added

Learnt from: mward-sudo
URL: https://github.com/bartblast/hologram/pull/721

Timestamp: 2026-02-19T01:23:42.278Z
Learning: User mward-sudo prefers that draft PRs should not be automatically reviewed by CodeRabbit until the draft status is removed.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

✅ Actions performed

Summary regeneration triggered.

mward-sudo added 7 commits February 18, 2026 22:39

Extracts UTF-8 code point decoding to the BitString utility class, ad…

9a49bec

…ds parameter validation

Extracts UTF-8 continuation byte validation to BitString class

105d2d9

Extracts Utf-8 code point validation to BitString class

7015625

Extracts UTF-8 sequence validation to BitString class

b9bfe2d

Extracts truncated UTF-8 sequence validation to BitString class

9ab7bc3

Extracts valid UTF-8 sequence length to BitString class

f458bf4

Extracts BitString to code points array to BitString class

8d3af8d

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

assets/js/bitstring.mjs Show resolved Hide resolved

assets/js/bitstring.mjs Show resolved Hide resolved

mward-sudo mentioned this pull request Feb 19, 2026

Extracts BitString from code point to BitString class #723

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Extracts BitString to code points array to BitString class#721

Extracts BitString to code points array to BitString class#721
mward-sudo wants to merge 7 commits intobartblast:devfrom
mward-sudo:02-19-extracts_bitstring_to_code_points_array_to_bitstring_class

mward-sudo commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

mward-sudo commented Feb 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mward-sudo commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mward-sudo commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 19, 2026 •

edited

Loading